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(57) This invention relates to ttie detection of secu- 
rity problems in a computer network or on any computer 
wittiin said network. To detect outsiders trying to break 
into a computer system (e.g. via the net) and/or to 
detect insiders misusing the privileges they have 
received (e.g. someone irttemal reading confidential 
data that hefehe is not entitled to), the invention uses a 
behavior-t>ased approach for a pattern-oriented intru- 
sion detection system. Employing a novel algorithm, the 
Teiresias algorithm not used before for intrusion detec- 
tion, the system represents the normal behavior of a 
process (103) by a pattern table (135). a pattern being a 
suttsequence of audit events or system calls or the like. 
During real operation, a pattern match (133) of the 
event stream generated on behaH of the actual process 
examined (123) with the entries in the pattern table 
(135) is tried. Sequences of unmatched events are a 
deviation from the normal tjehavior. Such a deviation 
indicates that an intrusion may be taking place which 
can thus raise an alamri (136) to single out, stop, or con- 
trol in any other vray the intrusion. 
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Description 
Technical Reld 

[0001 ] This invention relates to intrusion detection, i.e. 
ttie detection of security problems in a conputer net- 
vrork or on any conputer within said network It is par- 
ticularly suited to detect outsiders frying to Ixeak into a 
conputer system (e.g. via the net) and/or to detect 
insiders misusing the privileges they have received (e.g. 
someone internal reading confidential data ttiat hefehe 
is not entitled to). In brief, the invention uses a behavior- 
based approach tor a pattern-oriented intrusion detec- 
tion system. 

Background of the Invention 

[0002] Generally, an intrusion detection system 
dynamically nrxwitors actions that are taken in a given 
environment and decides whether these actions are 
synptomatic of an attack or constitute a legitimate use 
of the environmerrt 

[0003] Essentially, two main intrusion detection meth- 
ods are known. The first method uses the knowledge 
accumulated alxxjt attacks and looks for evidence of 
their exploitation. This method is referred to as knowl- 
edge-based. The second method ljuilds a reference 
model of the usual befravior of the system being nwni- 
tored and looks for deviations from the observed usage. 
This method is referred to as behavior-based. 
[0004] In the knowledge-based approach, the under- 
lying assurrption is that ttie system knows all possible 
attacks. There is some kind of a signature tor each 
attack and the intrusion detection system searches for 
these signatures when nx)nitoring the traffic. E.g., one 
may monitor the audit trails on a given machine, the 
packets going onto the net, etc. This first approach is 
addressed and described by Gligor et al. in US patent 5 
278 901, wrtiich also gives a good overview over the 
technology An advantage of this method is tiiat no or 
only few false alarms are generated, i.e. the false alarm 
rate is low; the main disadvantage is ttiat only those 
attacks can be located ttiat are already known. Any 
newly developed intrusion attack vw)uW usually remain 
undetected since its signature is still unknown and ttius 
ttie system does not search for it. 
[0005] Unfortunately, there are nowadays so many 
attacks ttiat the set of signatures is growing very fast 
Also, some signatures are difficult to express and an 
algorithm to search for ttiem can be rattier time-con- 
suming. Nevertheless, this approach has proven its 
usefulness and there are products using this approach 
available on the market: NetRanger by Cisco Systems. 
Inc., and RealSecure by Internet Security Systems, Inc., 
are two examples of such available products. 
[0006] The second, the t>ehavior-t)ased. approach 
starts from ttie assumption that if an attack is carried out 
against a system, its t»ehavior" will change. Therefore, 



ttie approach is to define a kind of mrmal profile of a 
system and vratch for any deviation from this defined 
normal profile. Different techniques can be applied (e.g. 
statistics, rule-based systems, neural networks) using 

5 different targets {e.g. ttie users of ttie system, the per- 
fomrances of ttie network, ttie CPU cycles, etc.). The 
main advantage of this mettiod over the knowledge- 
teased one is ttiat the attacks do not need to be known 
in advance, i.e. ttiat unknown attacks can be detected. 

10 Thus, the detection remains up-to<late without having 
to update some database of known signatures. But 
there are disadvantages: deviations can occur without 
any attack (e.g. changes in the activity of the user, new 
software installed, new machines, new users, etc.). 

,5 Therefore, all known efforts in fliis direction have been 
facing a rather high rate of false alarms. There appears 
to be only one product on the market using this 
approach: CI^DS by Science Applications International 
Corporation. 

20 [0007] In "A Sense of Self for Unix Processes" by S. 
Forrest et al.. Proceedings of ttie 1996 IEEE Sympo- 
sium on Security and Privacy, pp. 120-128, Oakland, 
California, May 1996, it is described how to model ttie 
behavior of the 'sendmeal daemon', i.e. a program run- 

25 ning permanently in the teackground without user inter- 
action, using ttie sequences of system calls that this 
program generates while running. The idea is to build a 
table of all ttie sequences of a given fixed lengtti (here 
5. 6, and 1 1) of consecutive system calls that could be 

30 found when watching such a sendmaJI d^mon running. 
The daim was ttiat if one tries to take advantage of a 
vulnerability in the sendmail code, then ttiis would gen- 
erate a sequence of systems calls not found in a "nor- 
mal" table, i.e. a t^le generated from a sample with 

35 normal behavior. However, wrtien experimenting with 
this approach, one discovers that the table necessary 
can become fairly large. It must be stressed that all ttie 
sequences of system calls in this tatie have the same 
lengtti, i.e. lengttis of 5, 6, and 1 1 . It has been shovm in 

40 "Fixed vs. Variat)le-Length Patterns for Detecting Suspi- 
cious Process Behavior" by H. Debar, M. Dacier, M. 
Nassehi, and A. Wespi, Proceedings of ESORICS 98. 
Louvain-la-Neuve. Belgium. Septemtjer 1998, that 
when trying to find what the best length for ttie 

45 sequences is ("best" meaning producing the shortest 
tatjie of patterns while covering all possiljle sequences) 
ttie result is ttiat ttie "best" length is 1. This means ttiat 
the system does not search for unseen sequences but 
for unseen system calls. The consequence is ttiat if an 

50 attack does not use any unseen system call it will not be 
detected. This is generally unacceptable since it may be 
possible to run an attack wittiout using a previously 
unseen system call. 

[0008] There are two classes of information sources 
55 for intrusion detection systems as described in 
"Towards a Taxonomy of Intrusion Detection Systems" 
by H. Det>ar. M. Dacier, and A. Wespi. IBM Research 
Report 3030. June 1998. Based on the location from 
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where the information can be retrieved, it is differertti- 
ated between host-based and network-based intrusion 
detection systems. Examples of host-based information 
sources are the so-called C2 audit trails, the syslog f aes 
known in the UNIX operating system, or the event logs 
in Windows NT. Network-based information is mainly 
retrieved by analyzing the network packets. 
[0009] As will be described in detail further below, the 
present invention relates to behavior-based intrusion 
detection using host-based information sources. For a 
given process, the intrusion detection system decides 
whether the process behavior can be judged as normal 
or abnormal. Abnormal behavior is an indication of an 
intrusion. 

[001 0] As mentioned atxjve, Fonrest et al. descrtoe in 
"A Sense of Self for UNIX Processes". Proceedings of 
the 1996 IEEE Symposium on Security and Privacy, pp. 
120 - 128. Oakland. Califomia. May 1996. a process 
model using a set of fbced-length patterns. These pat- 
terns correspond to all the possible patterns that can be 
found in the event sequences recorded during the train- 
ing phase. 

[0011] This poses a problem since a careful look at 
the sequences of audit events that can be generated by 
the so-called ftp daemon running under AIX shows that 
there are very long subsequences which repeat fre- 
quently. For example, many process instantiations start 
with an identical sutjsequence that has a length of 40 
audit events. Thus, since the described fixed-length 
approach does not consider such a characteristic, any 
result of an intrusion detection method based on such a 
fixed-length approach is tfistorted and certain intrusions 
and/or misuses cannot be detected. 
[001 2] In "Intrusion Detection via System Call Traces" 
by A.P. Kosoresow and S.A. Hofmeyr, IEEE Software, 
pp. 35 - 42. SepVOct. 1997, it is shown that variable- 
length patterns can be used to nxxJel the normal behav- 
ior of a process. However, the patterns presented in this 
put)lication were constructed manually due to the lack of 
an automated method. It is obvious that such a manual 
selection or design of the patterns is inadequate for an 
automatic intrusion detection of the land here 
approached. 

[001 3] A significantly different approach for a pattern- 
oriented intrusion detection system is disclosed in US 
Patent 5 278 901 to Shieh et al. It shows an intrusion 
detection system based on object privilege and informa- 
tion flow. i.e. does not use the deviation from a "typical" 
activity profile as described above. The approach by 
Shieh at al. is a knowledge-tjased intrusion detection 
system which is in contrast to the behavior-based 
approach of the present invention. Furthermore, the 
Shieh patent covers mainly the proljlem of detecting vio- 
lations against previously defined access control poli- 
cies wrtiile the present invention aims at detecting any 
type of attacks. The complexity of the solution chosen in 
the Shieh patent, however, makes this approach unsuit- 
able to solve the problems which the present invention 



addresses. 

[0014] To summarize, it is an object of the present 
invention to provide a simple and reliable method and 
apparatus for the detection of intrusions into a computer 

5 system, based on event patterns and particularly 
directed to detect deviations from a "normal" process 
behavior, and thus to detect attad© performed against 
said process. A more specific object is to generate, pref- 
erably automatically, so-to-speak "natural" patterns for 

10 the desaiption of the process behavior and thus pro- 
duce a very condensed resulting pattern table. Another 
specific object is to allow the use of highly efficient pat- 
tern matching algorithms, especially by producing a rel- 
atively small pattern matching taWa A further specific 

IS object is to produce a pattern table with most represent- 
ative patterns, independent of their length. A still further 
specific object is to produce a pattern table with less but 
longer entries than tables obtained writh known 
approaches and thus improve the detection of attacks. 

20 A still further object m to define rules that specify when 
a deviation from the normal behavior is significant 
enough to raise an alarm. 

Summary of the Invention 

25 

[0015] It appears that the idea of having sequences 
investigated is very important, but that buildng fixed 
length sequences is leading to unsatisfactory results, at 
least when these sequences do not exceed a certain 

30 minimal length. Therefore, the invention uses a new 
approach by focusing on a novel algorithm, the Teire- 
sias algorithm, as descrbed by I. Rigoutsos and A. Ror- 
atos in "(Combinatorial Pattern Discovery in Biotogical 
Sequences - The TEIRESIAS Algorithm" in Biolnfor- 

35 matics. pp. 55-67. VW. 14, No. 1. 1998. This algorithm is 
also siA}ject of a pending US patent application, serial 
number 023756 (YO 9-97-1 76), ffled 13 February 1998, 
not yet published. 

[001 6] The Teiresias algorithm, devek)ped for a differ- 

40 ent purpose and never considered for intrusion detec- 
tion, is used to search for patterns, i.e. all the 
subsequences that appear at least twice in a set of input 
sequences. Though there are other algorithms besides 
Teiresias that solve the problem of discovering all pat- 

45 terns, none of them is as efficient and fast as the Teire- 
sias algorithm. Generally speaking, the results achiwed 
with the Teiresias algorithm are far superior to anything 
produced by the prior art approaches. 
[001 7] A particular alvantage is that, using the Teire- 

50 sias algorithm, the longest patterns in a set of input 
sequences can be found. This is important since a table 
of long patterns appears to be more "representative" of 
a specific process than a table of short patterns. Since 
longer patterns usually contain more context informa- 

55 Hon, it appears that they are more signif icant for a proc- 
ess than short patterns. On the other hand, short 
patterns are not necessarily unique for a specific proc- 
ess, but may appear in other processes. It is even pos- 
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sible that short patterns are part of an attack. The 
longer a pattern is, the lower is the probability that this 
pattern is part of other processes or even an attack. 
(Consequently, it was found that there are attacks ttiat 
can be detected with the new technique according to 
the present invention, which attacks remained undetec- 
ted with other techniques. 

[0018] A further advantage is ttiat, otiviously, a small 
pattern iatie allows to implement efficient pattern 
matching algorithms and still works reasonably fast 
Both advantages lead to an improved detection of 
attacks. 

[0019] To summarize, the present invention provides 
a method and a system for reliably detecting intrusion 
patterns, thereby minimizing the probability of false 
alarms. 

[0020] The method and apparatus for an intrusion 
detection system according to the invention, using the 
desCTibed variable-length approach when investigating 
event patterns, operates in two modes, a training mode 
and an operation mode. 

[0021] In the training nxxJe, generally speaking, the 
behavior of a process is d^ined based on the system 
events it generates. System events are either the sys- 
tem calls tfiat are invoked by the process or the audit 
events generated on t)ehalf of the process. The process 
model is a table of patterns, i.e. sequences or sut)se- 
quences of events, which are representative of the proc- 
ess examined. To get a conplete picture of the process, 
it is inportant that as many different event sequences 
as possible are generated and analyzed. 
[0022] In this training mode, variat}le-length patterns 
are retrieved from the event sequences generated by or 
on behalf of the process. All events of a specific type 
generated from the invocation of the process until its 
end constitute an event sequence. Different process 
invocations may result in different event sequences. 
Patterns are subsequences of the event sequences; 
patterns that are characteristic for the process are 
stored in a pattern table. The pattern table represents 
the process model. 

[0023] In the operation mode of the present invention, 
it is decided whether the event streams created on 
behalf of the process can be matched by the patterns in 
the pattern table, w/hich conresponds to a normal proc- 
ess tiehavior, or whether there are subsequences of 
unmatched events. Unmatched everrts represent a devi- 
ation from the normal behavior and may thus indicate 
an intrusion or misuse, called an attack. Significant 
deviations result in raising an alarm. 
[0024] As already mentioned, the present invention is 
advantageous because patterns are generated that are 
"natural" for the description of the process tjehavior. The 
use of variatjie-length patterns to build the set of repre- 
sentative patterns results in a pattern table with less, but 
longer entries than tables obtained with other 
approaches. As explained, longer patterns contain 
more context information arxl are therefore more repre- 



sentative for a particular process than sfiort patterns. 
Furthermore, a small pattern increases the speed 
of the detection process. It is obvious that when looking 
for a pattem that matches part of a given sequence. 
5 searching in a small set of patterns is taster than in a 
large set. Therefore, small pattern tattles allow to speed 
up the pattem matching process. 

Brief Description of the Dravnngs 

10 

[0025] The foregoing and other features and advan- 
tages of the invention wifl be apparent from the following 
more detailed description of a preferred emtxxjiment of 
the invention, as illustrated in the accorrpanying draw- 
15 ings, in which: 

Rgure 1 shows the components of an intrusion 
detection system according to the inven- 
tion, based on the analysis of event pat- 
20 tems; 

Rgure 2 is a sample output of the Event Recording 
component; 

25 Figure 3 is a sample output of the Process RItering 
component; 

Figure 4 is a sample output of the Translation com- 
ponent; 



30 



35 



40 



45 



Rgure 5 is a sample output of the Reduction and 
Aggregation component: 

Figure 6 shows the patterns as detected by the Teir- 
esias algorithm for a set of sample input 
strings; 

Figure? is an illustration of the pattern reduction 
algorithm applied to the sample string set 
introduced in Rgure 6; 

Figure 8 is an illustration of the pattern matching 
algorithm for a case wfiere the input string 
can t>e covered; arxl 

Figures is an illustration of the pattern matching 
algorithm for a case where the input string 
cannot be completely covered. 



so Detailed Description of the Invention 

[0026] In the following first section, the components of 
the intrusion detection system are descrit>ed. In a sutv 
sequent second section, the algorithms used in 
55 selected components are discussed. 
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J. 7776 Intrusion Detectton System and its Components 

[0027] Figure 1 shows the cortponents of the intrusion 
detection system. The system consists of tvw) parts: an 
off^ine part and an on-One part The off-line part repre- 
sents the training phase or mode, and the on-line part 
the real operation or operation mode. In the training 
mode, a model of the normal behavior of the process 
examined is generated. In the operation mode, the 
instantiations of the process under the oliservation of 
the intrusion detection system are compared to the 
process model and, if a significant deviation is 
obsen^ed, an alarm 136 is raised. 
[0028] A process execution 1 03 can trigger different 
types of events 104. Either one of the following two 
event sources can be used for the present invention: 

C2 audit events as they are recorded by the audit- 
ing system available on mosX UNIX and some other 
operating systems. 
• System calls as they are recorded by programs 
coming with the operating system, e.g. strace, or by 
other system utilities. 

[0029] It has to be noted that the two sources cannot 
be used interchangeably. Either audit events or system 
calls have to be considered. Any further use of the term 
event in this document may relate to audit events as 
well as system calls. 

[0030] In the off-line part, it is possible to influence the 
process invocation in order to exercise as many differ- 
ent process execution paths as possitile. For this pur- 
pose, we use the functional verification tests (FVT) as 
they are used by software developers to test all the dif- 
ferent subcommands that can be executed by a proc- 
ess. 

[0031 ] Other approaches would be to define manually 
a set of subcommands that are expected to cover all the 
process execution pats, or to just record the events of 
the process running in a real environment 
[0032] The events generated on behalf of a process 
are recorded by an event recording component 105. 
Event recording component 105 may not only record 
events by the process examined but also by other proc- 
esses in the system. E.g. the auditing system does only 
allow to collect the audit events on a system level, i.e. 
either for all processes or for none. An event is 
described by several attributes, e.g. the process name, 
the event name, the process id. the parent process id, or 
the user id. 

[0033] Event recording component 105 forwards the 
events to training system 102. Events are fonwarded as 
triples comprising process name, event name, and 
process id. labeled 1 06. Rgure 2 shows a sample of the 
events that are sent from event recording component 
105 to a filtering conponent 107 in training system 102. 
Filtering component 1 07 first groups together the events 
belonging to the same process by keeping the chrono- 



logical order of the events. Ail events belonging to the 
same process are called an event sequence. An event 
sequence consists of an unique identifier and a list of 
events. The events are given as tuples comprising the 

s process and the event name. Not all event sequences 
are needed for the further processing. Only those 
sequences vKhich are needed to analyze the behavior of 
the examined process are fomvarded to a translation 
component 109. 

10 [0034] Figure 3 shows the events of Figure 2 after 
applying the filtering component 107. 
[0035] For the further internal processing, translation 
component 109 translates the event sequences into an 
internal data representation. Each event i.e. the tuple 

15 consisting of the process and the event name, is trans- 
lated into a single character of an alphabet E. The out- 
put of translation component 109 are sfrings of 
characters labeled 1 10. 

[0036] The translation is bijective, i.e. identical events 
20 are translated into the same character, and a cfiaracter 
is the translation of identical events only. The translation 
rules are generated on the fly and stored in a translation 
table 136. Figure 4 shows the result of the translation of 
the event sequerKes into sb'ings. 
25 [0037] The strings can be reduced further. A reduction 
and aggregation component 1 1 1 performs two tasks: 

Duplicate st^ings are removed. 
Consecutive occun-ences of the same ctiaracter 
30 are aggregated into a smaller number of the same 
character. 

[0038] The first task results in a set of unique strings. 
Duplicate strings do not add any value to the further 

35 processing as will be seen later. 

[0039] It can be observed that sut>sequences of N,N 
>1, events are quite frequent, with N exhit)iting small 
variations. An example is the ftp login session, wrfiere 
the ftp daemon closes several file handles inherited 

40 from the inetd process. Since the inetd process is not 
always in the same state, the number of its f ile handles 
may vary. As a consequence, the ftp daemon inherits 
not always the same number of file handles. Closing all 
the unneeded file handles results therefore in a varying 

45 number of file close operations. 

[0040] There are tvm possible ways to aggregate 
characters: 



50 



55 



The identical consecutive characters are replaced 
vnth an extra, not yet used character. 
The N identical consecutive characters are com- 
prised into M, 

1<=M<=N, 



characters. 
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[0041] "Hie first approacfi inaeases tfie numtjer of 
unique events and possiljly also ttie nunt>er of patterns. 
Since ttie numtier of patterns should Ije kept small, the 
second approach with M = f has t>een selected. The 
newly created strings have less semantics tfian the orig- 5 
inal ones. Ixit no case is known where the character 
aggregation impacts the operation of the intrusion 
detection system. 

[0042] The output of the reduction and aggregation 
conponent 1 1 1 are unique strings, latieled 1 12. where 10 
consecutive occurrences of the same ctiaracter are 
removed. Figure 5 shows the strings of Rgure 4 after 
being processed by reduction arxl aggregation compo- 
nent 111. 

[0043] A pattern extraction component 113 deter- is 
mines the patterns which constitute the process model 
and stores them in a patlem table 135. The algorithms 
used to build pattern table 135 are explained in detail in 
sutjsequent sections. 

[0044] Pattern t£±)le 135 is a key part of the intrusion 20 
detection system according to the invention. K links the 
off-line system 102 writh the on-line system 122. 
[0045] The on-tine part has nearly the same compo- 
nents as the off-line part However, a main difference is 
that the process to be examined is not under control of 2S 
the inbusion detection system. In on-line system 122. 
the event recording component 125 and the process fil- 
tering conponent 127 are the same as in the off-line 
system. The tianslation component 129 "is different with 
respect to the fact tiiat audit events are translated based 30 
on the entiles retiieved from a translation table 134. If 
there is an event for wrtiich no entry in tianslation table 
129 exists, the event is b^anslated to a dummy character. 
For each event sequence, it has to be decided whether 
there is a sign of an intrusion or not. Therefore, there is 35 
no reduction component Gike component 1 1 1 in the off- 
line part) tiiat removes duplicate sequences as in the 
tiBining system. There is only an aggregation conpo- 
nent 131. 

[0046] The pattern matching component 1 33 receives 40 
its input sti-ings 132 from the aggregation component 
131. By applying the algorithm described in the subse- 
quent section of this description, it is tried to match all 
the input sti-ings writh the patterns of pattern table 135. 
hkjwever, ttiere may be strings tiiat remain viritti uncov- 45 
ered characters. Depending on the number of consecu- 
tively uncovered characters, it is decided whether there 
is an indication of an inti-usion, and whettier an alarm 
136 has to be issued. 

[0047] For the ftp daemon, a threshold of 6 characters so 
was selected, i.e. if there are 7 subsequent uncovered 
characters, an alarm is issued. 



rithms we know so far. However, we can tiiink of varia- 
tions of ttiese algorithms. For example, the algorithm to 
build the pattern table sorts tiie patterns based on the 
number of characters tiiey can cover at the beginning 
and end of an input string. We can think of otiier sort cri- 
teria Dke tiie total number of characters covered by a 
pattem or the number of occurrences of a pattem. 

2. 1 Terminology and Notation 

[0049] Consider a finite set of Cfiaracters 

E = {c,. C2 c„]. The set E is called an alphabet 

To denote a stiing of n, n > 0, identical consecutive 
characters c e E, we write c". The term c* denotes a 
string of identical consecutive characters of arbitrary 
lengtii /, / >= 0. The term c* denotes a stiing of identical 
consecutive characters of ait)itrary length m, m>0. To 
denote an aitHti^ry sti-ing of lengtii n, n > 0. we write 
{.)". {.}• denotes an artsitrary string of arbitrary length /, 
/ >= 0. and (.} * denotes an arbrtrary stiing of length m, 
m>0. 

[0050] The lengtii of stiing sis written as |s). We write 
c e s if the ctiaracter c is contained in the string s. 
[0051 ] Given is a set of sti-ings 

S = {si, S], s,) 



2. Algorittims 

[0048] In ttiis section, a sample algorithm to build the 
pattern table and a sample algorithm to cover ttie input 
sti-eam are descrtoed. They represent the best algo- 



over ttie alphabet L. A substring p tiiat 

• occurs at least twice in tiie set of sti^ings S, and 
has a lengtii |p| of two or more characters 

IS called a pattern. 

[0052] p" denotes the pattern p repeated n, n> 0. 
times, p'denotes tiie pattern p repeated /, / >= 0. times, 
and p*denoles ttie pattern p repeated m,m>0, times. 
[0053] A pattern p is maximal, if ttiere is no pattern q 
for which holds: 

p is a substring of q witti |p| < |ql, and 
the number of occurrences of the pattern q e S is 
equal or larger than ttie number of occunences of 
ttie pattern p e S. 

[0054] A character c e s is said to be covered by tiie 
pattern p, if c e p and p is a substring of s. 
[0055] A sti^ing s is said to be covered by a set of pat- 
terns R if for each character c,ces, there is a pattern 
p, p e P.so ttiat c is covered by p. 
[0056] A set of sti^ings S is said to be covered by a set 
of patterns P. rt each siring s, s e S, is covered by P 
Additionally P is said to cover S. 
55 [0057] Given are a pattern p and a sfring s. Let us 
decompose the string s as follows: 
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s=p'{.}*p' l,r>=0 

It is assumed that the decomposition is maximal, i.e. 5 
there are no /'and r'tor which holds 

/' + r'>/+r. 



[0058] The expression (/+!)• iPl. ' e- the sum / + r 
times the pattem length |p|, fe called margin cova- of the 
pattern p and the string s. It is written as mCover(p,s). 
The margin cover of a pattern p and a string 

S = {si, Sj, s,}, 
set wnitten as mCover(p,S), is defined as 

^1=1 mCover(p,Si). 



[0059] The total cover of a pattem p and a string set 
S,tCover(p,S) is the total number of characters that can 
be covered by the pattern p. so 

2.2 Determining the Set of Maximal Variable-Length 
Patterns 

[0060] In a first step, all maximal patterns contained in 35 
the set S of input strings have to be determined. For this 
purpose, the Teiresias algorithm, as described by I. Rig- 
outsos and A. Floratos in "Combinatorial Pattern Dis- 
covery in Biological Sequences - The TEIRESIAS 
Algorithm" in Biolnfbrmatics, pp. 55-67, Vol. 14. No. 1. 40 
1998. is used. A minimal pattern length m can be spec- 
ified as argument for the Teiresias algorithm. The Teire- 
sias algorithm will then only find the maximal patterns 
whose length is equal to or greater than this given mini- 
mal length. ^ 
[0061 ] Part a) of Figure 6 shows a sample input set of 
3 strings and part b) shows the corresponding pattern 
set as discovered by the Teiresias algorithm (with m = 
2). For each pattem. the total number of occurrences 
(first column) as well as the number of strings in which it so 
occurs (second column) is given. 

2.3 Reducing the Set of Patterns 

[0062] Out of the set of patterns P consisting of all the 55 
maximal patterns found for the string set S, a sutjset of 
patterns ff, ff c p, is selected which covens S. As an 
example, the following algorithm can be used to build 



the reduced pattem set R. 

1. Let m denote the minimal pattern length that 
used to generate the set of maximal variat)le-length 
patterns. Add each s. 

s e SA\s\<2 -m, 

to P and remove it from S. 

2. If P = 0. then add all s e S to the reduced pat- 
tem set R and exit. 

3. For each p e P calculate mCover(p,S). 

4. If there is a pattem p fulfilling mCover(p,S) > 0, 
then select a pattern r for which mCover(r,S) is 
maximal, i.e. there is no pattern q for which holds: 

mCover(q^) > mCover(r^) or 
mCover(q^) = mCover(r^) A I9I > |r|. 



5. Add r to the reduced pattern set R and remove it 
from P. 

6. Remove first all the matching substrings adjacent 
to the beginning and end of a string, i.e. remove 
strings of the form 

s = r\ 

and replace strings of the form 

s = r*s' OT s = r*s" 

with s' or s", respectively 

7. Remove then the matching substrings that are 
not ^jacent to the beginning and end of a string, 
i.e. as long as there is a s e S, 

being the minimal pattern length and v,w > 0, 
replaces with the two strings s' and s". v and w 
specify the minimal length of the resulting new 
strings s' and s". respectively Setting them equal 
to m enforces that all patterns have a minimal 
length m. However, different settings are possK)le. 

8. If one of the strings s that have been newly 
aided to S has a length \s\<2 • m,m being the 
minimal- pattern length, remove s from the set of 
strings S and sM it to the set of patterns P. 

8. If S*0, gotost^2. 

[0063] Rgure 7 illustrates the pattern reduction algo- 



20 



25 



7 



13 



EP 0 985 995 A1 



14 



rithm applied to the sample string set introduced in Fig- 
ure 6. For each reduction step, ttie string set the pattern 
set. and the reduced pattern set are shown. For each 
pattern in the pattern set its mCovervalue is Hsted. The 
pattern with ttie highest mCover value is moved to the 5 
reduced pattern set and matching sutjstrings are 
removed from the string set. In ttiis example, not all pat- 
terns are needed to cover all the strings. 
[0064] In ttie example of the ftp daemon, the Teiresias 
algorittim determines alxjut 600 maximal variatjie- »o 
lengtti patterns. After applying ttie reduction algorittim, 
atwut 50 patterns remain. 



2.4 Pattern Matching 



15 



[0065] We describe a sample pattern notching algo- 
rittim. TTie algorithm tries to match ttie input stream by 
concatenating patterns, i.e. ttie patterns are placed one 
right after ttie other. A variation would be to allow over- 
lapping patterns. ^ 
[0066] At certain points of the pattern matching proc- 
ess, there may be several patterns that match tiie input 
stream and it has to be decided which pattern to select 
As an heuristic, a pattern is selected if a sequence of d, 
d> 0, pattems can be found ttiat matches the input 2s 
stream right after ttie pattern under consideration. 

1 . Set the counter of consecutively uncovered char- 
acters, u. to 0. 

2. Wait until ttiere are at least 3° 

k = d- \pme«>\ 

Characters in ttie input sfc-eam where d is the 3s 
parameter as explained in ttie inttoduction to this 
algorithm and iPmeanI "s the mean length of all pat- 
terns p e P or until ttie end of an input sequence 
has been reached. 

3. Find a pattern p e P that covers ttie beginning of 40 
the input stream T If no pattern can be found, go to 
steps. 

4. Find d> 0 patterns Qi, qz, -, Qd. so ttiat the 
sbing 

45 

t=pqiq2.-qd 

covers the beginning of the sb-eam. If ttiere are e 
patterns q,, Qz, -, q^, 0 < e < d, ttiat cover ttie so 
whole input sequence, set 



(a) If f matches ttie whole input sequence, remove 
the input sequence and go to step 1. (b) If d pat- 
terns can be found covering ttie beginning of ttie 



55 



input stream, remove ttie pattern p from the input 
stream and go to step 1 . 

5. Determine all pattern combinations that cover 
the beginning of the input sfream. Select the pat- 
tem combination that covers ttie longest character 
sequence, remove it from the input sfream, and go 
tostep 1. 

6. Skip one character and increase u by 1. 

7. If 

H = n + /, ft 

being the threshold for ttie number of consecutively 
uncovered characters, raise an alarm. 

8. Go to step 2. 

[0067] Rgures 8 and 9 are two illustrations of the pat- 
tern matching algorithm. One for the case of a fully-cov- 
ered string, one for ttie case of a partially-covered 
string. 

[0068] The pattern matching algorithm processes ttie 
pattern table fc)und on top of Rgures 8 and 9. In Figure 
8, the matching algorithm first finds the pattern "ABC" 
matching ttie beginning of the sample sfring. Before 
accepting this pattern as a valid match, the pattem 
"ABC" must be validated be finding d=3 patterns match- 
ing ttie remainder "ABCDXYZGHl" of ttie sti^ing. Since 
such three patterns can be found, namely "ABCD". 
XYZ". and "GHl". ttie pattern "ABC" is accepted and 
deleted from ttie sample string. Processing continues 
witti ttie sfring "ABCDXYZGH 1". 
[0069] In Rgure 9, we have again the same pattern 
table as in Rgure 8. However, the sample string to be 
matched is somewhat different. Again, ttie pattern 
"ABC" is selected as candidate to match ttie beginning 
of the sample string. However, we cannot find d=3 pat- 
terns that match the remainder "ABCDKMWHF" of ttie 
sfring. This implies ttiat another pattern ttiat matches 
the beginning of ttie sample sfring should be tried. 
Since there is no such ottier pattern, the pattern 
sequence has tt) be found that matches ttie longest por- 
tion of the sample string. Because ttie two pattern 
"ABC" and "ABCD" match the longest subsequence, 
ttiey are selected. The characters "KMW" cannot be 
matched and are skipped. Processing continues witti 
ttie sfring "HP. 

[0070] A pattern-oriented, behavior-based. variat)le- 
lengtti intrusion detection model was defined. The main 
advantage of ttiis inventive model is that it generates a 
kind of "natural" patterns or signatures of the process to 
be monitored, which patterns very well represent ttie 
"normal" behavior. Thus, deviations - which indicate 
infrusion or misuse - can be easier detected ttian with 
previously known methods. 

[0071] While the present invention has been particu- 
larly shown with reference to one specific embodiment, 
it is obvious to someone skilled in the art ttiat it can be 
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adapted to match the environment in which it is going to 
be used, whether for the detection of unauthorized 
transactions in the tanking arena, of viruses in a com- 
puter networit, of unauthorized entry to Ixiildin^ with 
restricted access, or of unallowed data exchange 5 
between data bases, to name a few. 



• in the training mode, the first event sequence is 
conpressed according to a given set of aggre- 
gation rules, 

in the operation mode, the second event 
sequence is compressed using said given set 
of aggregation rules. 



daims 

1. A method for detecting intrusion attertpts in a confi- 
puter or computer system, said metfiod comprising 
in combination: 

in a training mode, building a table of character- 
istic, process-constituting patterns defining 
normal behavior of a model process in said 
computer or computer system by performing 
the following steps: 

• building a first event sequence by filtering a 
first event stream generated by said model 
process. 

• Ijy using the so-called Teiresias algorithm, 
extracting event sequence patterns from 
said first event sequence, said patterns 
constituting said model process, 

storing said process-constituting patterns; 
arxf 

in an operation mode, extracting characteristic 
patterns from an actual process by performing 
the following steps: 

buiWing a second event sequence by filter- 
ing an event stream generated by said 
actual process, 

matching said second event sequence with 
said stored process-constituting patterns, 
and 

• indicating the resutt of said matching step. 

2. The method for intrusion detection according to 
daim 1, wherein 

• in ttie training mode, the first event sequence is 
translated and ttie rules used for said transla- 
tion are stored, 

in the operation mode, the second event 
sequence is translated using said stored trans- 
lation rules. 

3. The method for intrusion detection according to 
claim 2, wfherein the translation is a dynamic, on- 
the-fly translation. 

4. The method for intrusion detection according to any 
of the claims 1 to 3. wherein 
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5. The method for intrusion detection according to one 
or more of ttie preceding claims, wherein 

• an event stream generated by either one or 
both of the processes is recorded and 
said recorded event stream filtered to build the 
first and/or second event sequence. 

6. The method for intrusion detection according to one 
or more of the preceding claims, wherein the proc- 
ess-constituting patterns of the first event 
sequences contain patterns of varying lengths, in 
particular patterns of maximal lengtfis. 

7. The method for intrusion detection according to one 
or more of the preceding claims, further including a 
reduction step in the training mode, wrtiereby any 
duplications in the obtained event sequence are 
renxjved. 

8. The method for intrusion detection according to one 
or more of the preceding daims, wherein the train- 
ing mode is carried out under ttie control of the 
intrusion detection system. 

9. The method for intrusion detection according to one 
or more of ttie daims 2 to 7. wherein all or part of 
ttne mefliod steps are applied in the following 
sequences 

in the training mode: 

1 . event recording. 

2. process tittering. 

3. translation and storage of translation 
rules. 

4. reduction and aggregation. 

5. pattern extraction and storage; and 

in the operational mode: 

1. event recording. 

2. process tittering. 

3. translation based on stored translation 
rules, 

4. aggregation, 

5. pattern matching with stored patterns. 

10. An apparatus for detecting intrusion attempts in a 
computer or computer system, said apparatus com- 
prising in combination: 
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a first filtering cortponent (1 07) for f ittering, in a 
traning nxxle branch, a first event stream gen- 
erated by a nxxlel process (103) and building a 
first event sequence (108), 
a pattern extraction component (113) extract- 
ing event sequence patterns from said first 
event sequence by using the so-called Teire- 
sias algorithm, said patterns constituting said 
model process, 

• a pattern table component (135) storing said 
extracted, process-constituting patterns din- 
ing normal behavior of said nwdel process, 

• a second filtering component (127) for building, 
in an operation mode branch, a second event 
sequence by filtering an event stream gener- 
ated by an actual process (123). 

• a pattern matching component (133) for match- 
ing said second event sequence vwith said proc- 
ess-constituting patterns stored in said pattern 
table cortponent (135). arxl 

• an Incficator component (1 36) for indicating the 
output of said matching component (133). 

11. The apparatus for intrusion detection according to 
daim 10. further comprising 25 

• afirsttranslatron component (109) which, in the 
training mode branch, translates the first event 
sequence, 

• a translation table (134) for storing the transla- 
tion rules used for said translation, 

• a second translation component (129) which, in 
the operation mode branch, translates the sec- 
ond event sequence, using said translation 
rules stored in said translation tatJie (134). 
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working sequences 

in the training mode branch: 

1 . event recording component (1 05). 

2. process tatering component (107). 

3. translation component (109) and trans- 
lation table (134). 

4. reduction and aggrecption component 
(111). 

5. pattern extraction component (1 13) and 
pattern table (135); and 

in the operational mode branch: 

1. event recording component (125). 

2. process filtering component (127). 

3. translation component (129), connected 
to said translation table (134), 

4. aggregation component (131), 

5. pattern matching component (133), con- 
nected to said pattem tatde (135). 

15. The apparatus for intiusion detection according to 
one or more of the preceding claims, wherein all or 
part of the components are ananged such that, to 
avoid duplication, they can be used alternatively 
either in ttie training mode branch or the operation 
mode branch. 
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12. The apparatus for intrusion detection according to 
any of the claims 10 and 1 1 . further comprising 

• a first compression component (111) which, in 
the tiaining nrwde branch, compresses tine first 
event sequence according to a given set of 
aggregation rules, and 

a second compression component (131) 
which, in tiie operation nxxle branch, com- 
presses ttie second event sequence using said 
given set of aggregation rules. 

13. The apparatus for intrusion detection according to 
claim 12, further comprising a reduction component 
for removing duplicates in the event sequence 
obtained in the b^aining mode branch, in particular a 
reduction component combined witti the first com- 
pression component (11 1). 

14. The apparatus for intrusion detection according to 
one or more of ttie claims 10 to 13. wherein all or 
part of ttie components we ananged in ttie following 
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Process Event Process Id 

ftpd FILE_Close 16415 

Is PROC_Execute 16415 

Is FILE_CIose 16415 

fingerd PROC_Execute 18210 

is PROC_Delete 16415 

fingerd PROC_SetSignal 18210 

ftpd PROC_Create 18303 

ftpd FILE_Close 18303 

ftpd FILE_Close 18303 

ftpd FILE_Close 18303 

fingerd FILE_Read 18210 

fingerd PROC_Create 18210 

ftpd FILE_Close 18303 

ftpd PROC_SetSignal 18303 

ftpd FILE_Read 18303 

ftpd FILE_Read 18303 

ftpd FILE_Write 18303 

ftpd PROC_Delete 18303 

fingerd PROC_Execute 19415 

fingerd PROC_SetSignal 19415 

fingerd FILE_Read 19415 

fingerd PROC_Create 19415 

Fig. 2 
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0: (ftpd, FILE_Close), (Is, PROC_Execute), 
(Is, FILE_Close), (Is, PROC_Delete) 

1: (fingerd, PROC_Execute), 

(fingerd, PROC_SetSignal), (fmgerd, FILE_Read), 
(fingerd, PROC_Create) 

2: (ftpd, PROC_Create), (ftpd, FILE_Close), 
(ftpd, FILE_Close), (ftpd, FILE_Close), 
(ftpd, FILE_Close), (ftpd, PROC_SetSignal), 
(ftpd, FILE_Read), (ftpd, FILE_Read), 
(ftpd, FILE_Write), (ftpd, PROC_Delete) 

3: (fingerd, PROC_Execute), 
(fingerd, PROC_SetSignal), 
(fingerd, FILE__Read), (fingerd, PROC_Create) 

Fig. 3 



0: ABCD 
liEFGH 

2:IAAAAJKKLM 
3:EFGH 

Fig. 4 
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0: A B C D 
1:EFGH 
2:IAJKLM 

Fig. 5 



0:ABCDEA b) 4 3 BC 

liBCFDEABCD 43 DE 

2:BCEADEFDE 33 EA 

22 ABCD 

22 DEA 

22 FDE 



Fig. 6 
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Strings 



ABCDEA 
BCFDEABCD 
BCE ADEFDE 



Pattern set 



BC 4 

DE 2 

EA 2 

ABCD 8 

DEA 3 

FDE 3 



Reduced 
pattern set 
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DE 
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E A 
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DE 



DE 
EA 
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DE 
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ABCD 

FDE 

BC 
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FDE 
BD 
EA 



DEA 



Fig, 7 
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Table: 
String: 



ABCD XYZD ABC GHI XYZ HF 
ABCABCDXYZGHI ABC Selectable 



ABCD XYZD ABC 
Remainder of string: ABCDXYZGHI 

ABCD 

Remainder of string: XYZGHI 

ABCD XYZD i\BC GIB XYZ 
Remainder of string: GHI 

ABCD XYZD T^tBC GHI ABC Validated 



ABCD XYZD ABC 
Remainder of string: ABCDKMWHF 
Possible combinat.: ABC/ABC --> Length 6 

ABC/ABCD --> Length 7 --> Selected 



Fig, 8 



Table: 
String: 



ABCD XYZD ABC GHI XYZ HF 
ABCABCDKMWHF ABC Selectable 



K 
M 
W 
HF 



— > Shifted u = 1 
— > Shifted u = 2 
— > Shifted u = 3 



ABCD 



u = 0 

XYZD i\BC GUI XYZ HF 



Fig. 9 
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